W2 Lab Assignment

Internet Movie Database (IMDb) provides various information about movies, such as total budgets, lengths, actors, and user ratings. They are publicly available from here. In this lab, let's explore a processed dataset named 'imdb.csv', which contains some basic information of movies.

Download the file from Canvas. There are 4 columns separated by tab:

Title: title of the movie;
Year: release year;
Rating: average IMDb user rating;
Votes: number of IMDB users who rated this movie

First, we want to get some insights from the data with Python; then we want to display information on a web page and prettify it with html/css.

Things to note:

Let's use Python 3.5;
There are 313,012 lines in the file. When printing things, print selectively.

Part 1. Data manipulation with Python

Q1: What is the first and last year in this dataset? How many movies released in each year?

To do this, we first need to read the CSV file. Python provides the csv module to read and write CSV files. The csv.reader function returns a Python object which will iterate over lines in the given file. Each line is returned as a list of strings, so that we can access a particular column using list index. If we want to ignore the first line, we can use islice. It is like slicing a list, but it can slice an iterator (e.g. file stream). For instance, islice(reader, 0, 5) means "give me the first 5 items from the reader". islice(reader, 1, 5) means "give me the 4 items starting from the second item".

A basic usage example to read the first 11 lines of 'imdb.csv':



In [20]:

    
import csv
from itertools import islice

f = open('imdb.csv', 'r')
reader = csv.reader(f, delimiter='\t')
for row in islice(reader, 0, 5):
    print(row)
    print(row[1])









    



['Title', 'Year', 'Rating', 'Votes']
Year
['!Next?', '1994', '5.4', '5']
1994
['#1 Single', '2006', '6.1', '61']
2006
['#7DaysLater', '2013', '7.1', '14']
2013
['#Bikerlive', '2014', '6.8', '11']
2014

There are many ways to do Q1. One way is to use dictionaries where the key: value pairs are:

key: year
value: a list of movie titles or number of movies



In [2]:

    
dt = {}
year = 1972
if year not in dt:
    dt[year] = 1
else:
    dt[year] += 1
print(dt)

Python automates the job above by using Counter.



In [3]:

    
from collections import Counter

movie_counter = Counter()
movie_counter[1972] +=1 
print(movie_counter[1972])
print(movie_counter[1970])

1
0

Once all lines are read, we want to print the dictionary, which can be done by iterating its key: value pairs.



In [4]:

    
for key,val in dt.items():
    print(key,val)
for key,val in movie_counter.items():
    print(key,val)

You can get the keys (the years) by using .keys() function.



In [5]:

    
movie_counter[1980] += 5
movie_counter[2015] += 1
movie_counter.keys()









    Out[5]:





dict_keys([1980, 1972, 2015])

and you have convenient functions like min() and max() for calculating the min and max value of a list or iterable.



In [6]:

    
alist = [23,3,5,4,2,1,1,0,1000]
print(min(alist))
print(max(alist))

Code for Q1



In [30]:

    
import pandas as pd
imdb = pd.read_csv('imdb.csv', delimiter='\t')



In [31]:

    
imdb.head()









    Out[31]:






  
    
      
      Title
      Year
      Rating
      Votes
    
  
  
    
      0
      !Next?
      1994
      5.4
      5
    
    
      1
      #1 Single
      2006
      6.1
      61
    
    
      2
      #7DaysLater
      2013
      7.1
      14
    
    
      3
      #Bikerlive
      2014
      6.8
      11
    
    
      4
      #ByMySide
      2012
      5.5
      13



In [34]:

    
min(imdb['Year'])









    Out[34]:





1874



In [35]:

    
max(imdb['Year'])









    Out[35]:





2017



In [48]:

    
from collections import Counter
Counter(imdb["Year"])









    Out[48]:





Counter({1874: 1,
         1878: 1,
         1887: 1,
         1888: 5,
         1889: 2,
         1890: 5,
         1891: 9,
         1892: 9,
         1893: 2,
         1894: 94,
         1895: 116,
         1896: 678,
         1897: 479,
         1898: 321,
         1899: 242,
         1900: 265,
         1901: 254,
         1902: 217,
         1903: 261,
         1904: 214,
         1905: 177,
         1906: 182,
         1907: 197,
         1908: 267,
         1909: 405,
         1910: 389,
         1911: 309,
         1912: 376,
         1913: 311,
         1914: 315,
         1915: 361,
         1916: 328,
         1917: 317,
         1918: 286,
         1919: 313,
         1920: 323,
         1921: 345,
         1922: 328,
         1923: 393,
         1924: 466,
         1925: 508,
         1926: 554,
         1927: 581,
         1928: 609,
         1929: 671,
         1930: 836,
         1931: 939,
         1932: 1026,
         1933: 1024,
         1934: 1120,
         1935: 1174,
         1936: 1235,
         1937: 1245,
         1938: 1230,
         1939: 1162,
         1940: 1160,
         1941: 1169,
         1942: 1193,
         1943: 1105,
         1944: 969,
         1945: 876,
         1946: 952,
         1947: 1010,
         1948: 1084,
         1949: 1208,
         1950: 1283,
         1951: 1318,
         1952: 1316,
         1953: 1393,
         1954: 1397,
         1955: 1476,
         1956: 1479,
         1957: 1604,
         1958: 1533,
         1959: 1572,
         1960: 1567,
         1961: 1623,
         1962: 1669,
         1963: 1635,
         1964: 1823,
         1965: 1896,
         1966: 2025,
         1967: 2086,
         1968: 2199,
         1969: 2320,
         1970: 2240,
         1971: 2370,
         1972: 2445,
         1973: 2325,
         1974: 2392,
         1975: 2286,
         1976: 2399,
         1977: 2264,
         1978: 2386,
         1979: 2526,
         1980: 2438,
         1981: 2485,
         1982: 2537,
         1983: 2647,
         1984: 2779,
         1985: 2908,
         1986: 2882,
         1987: 3049,
         1988: 3054,
         1989: 3193,
         1990: 3093,
         1991: 2993,
         1992: 3136,
         1993: 3128,
         1994: 3415,
         1995: 3698,
         1996: 3923,
         1997: 4353,
         1998: 4651,
         1999: 5138,
         2000: 5575,
         2001: 6042,
         2002: 6694,
         2003: 7355,
         2004: 8584,
         2005: 9508,
         2006: 10115,
         2007: 10147,
         2008: 11095,
         2009: 12268,
         2010: 12931,
         2011: 13944,
         2012: 13887,
         2013: 13048,
         2014: 10862,
         2015: 4402,
         2016: 2,
         2017: 1})

Q2: What is the average ratings/votes?

We can store the ratings/votes column as a list and then calculate various basic statistics (mean, median, etc.). To do this, we can use the NumPy library and call the function numpy.mean and numpy.median. For example,



In [10]:

    
import numpy as np

alist = [1,3,6,2,5,2]
print(np.mean(alist))
print(np.median(alist))

Code for Q2



In [41]:

    
# implement below
imdb['Rating'].mean()









    Out[41]:





6.2961953413777811



In [42]:

    
imdb['Votes'].mean()









    Out[42]:





1691.2317746021706

Q3: What are the 5 movies that have the highest ratings/votes?

Store the movie titles and ratings information as a dictonary:

key: movie title
value: movie rating

Then, we can sort the dictionary based on its values, which will return a list of tuples. Note to print only the top 5 movies.



In [12]:

    
import operator

dt = {1971: 2, 1975: 10, 1962: 1, 1980: 50, 1981: 55}
sorted_x_by_val = sorted(dt.items(), key=operator.itemgetter(1), reverse=True )
print(sorted_x_by_val)
for elem in sorted_x_by_val:
    print(elem[0],elem[1])









    



[(1981, 55), (1980, 50), (1975, 10), (1971, 2), (1962, 1)]
1981 55
1980 50
1975 10
1971 2
1962 1

Code for Q3



In [45]:

    
# implement below
import warnings
warnings.filterwarnings('ignore')
imdb.sort_index(by=['Rating'], ascending=[False]).head()









    Out[45]:






  
    
      
      Title
      Year
      Rating
      Votes
    
  
  
    
      57863
      Adolfo Perez Esquivel: Rivers of Hope
      2015
      9.9
      9
    
    
      42123
      The Red Shirt Diaries
      2014
      9.8
      6
    
    
      140553
      High-Rise
      2015
      9.8
      5
    
    
      131241
      Girls Loving Girls
      1996
      9.8
      5
    
    
      24902
      Mari White Presents the Newsboys
      2011
      9.7
      6



In [47]:

    
imdb.sort_index(by=['Votes'], ascending=[False]).head()









    Out[47]:






  
    
      
      Title
      Year
      Rating
      Votes
    
  
  
    
      279320
      The Shawshank Redemption
      1994
      9.3
      1511933
    
    
      264590
      The Dark Knight
      2008
      9.0
      1487023
    
    
      149895
      Inception
      2010
      8.8
      1285905
    
    
      122656
      Fight Club
      1999
      8.9
      1189053
    
    
      223981
      Pulp Fiction
      1994
      8.9
      1177471

Name the .ipynb file with file name 'lab02_lastname_firstname', and upload to Canvas under [w2] lab assingment.

Part 2. html and css

1. Set up a local web server

Many browsers don't allow loading files locally due to security concerns. We can get around by creating a local web server with Python by the following:

Open the ‘Command Prompt’.
Move to the folder where you keep your lab materials by typing ‘cd FOLDER_LOCATION‘. We will use this folder as the ‘root’ for our webserver.
Then type ‘python -m SimpleHTTPServer’.

If successful, you'll see

Serving HTTP on 0.0.0.0 port 8000 …

This means that now your computer is running a webserver and its IP address is 0.0.0.0 and the port is 8000. Now you can open a browser and type "0.0.0.0:8000" on the address bar to connect to this webserver. Equivalently, you can type "localhost:8000". After typing, click on the different links. You can directly access one of these links by typing in ‘localhost:8000/NAME_OF_YOUR_FILE.html’ in the address bar.

2. html review

Webpages are written in a standard markup language called HTML (HyperText Markup Language). The basic syntax of HTML consists of elements enclosed within ‘<’ and ‘>’ symbols. Browsers such as Firefox and Chrome parse these tags and display the content of a webpage in the designated format. This is called rendering.

Here is a list of important tags and their descriptions.

html - Surrounds the entire document.
head - Contains info about the document itself. E.g. the title, any external stylesheets or scripts, etc.
title - Assigns title to page. This title is used while bookmarking.
body - The main part of the document.
h1, h2, h3, h4, h5, h6 - Headings (Smaller the number, larger the size).
p - Paragraph.
br - Line break.
em - emphasize text.
strong or b - Bold font.
a - Defines a hyperlink and allows you to link out to the other webpages.
img - Place an image.
ul, ol, li - Unordered lists with bullets, ordered lists with numbers and each item in list respectively.
table, th, td, tr - Make a table, specifying contents of each cell.
<!--> - Comments – will not be displayed.
span - This will not visibly change anything on the webpage. But it is important while referencing in CSS or JavaScript.. It spans a section of text, say, within a paragraph.
div - This will not visibly change anything on the webpage. But it is important while referencing in CSS or JavaScript. It stands for division and allocates a section of a page.

Use the top 5 voted movies found in the first part, try the following:

Create a table with the following columns: Movie Title, Year, Rating, Votes.
Create a link with each movie title to its IMDB page.
Add a title for the table. Can you change its font and set it to bold?
Change the background color of the page.
Add an entry of your favorite movie to the table. Can you set the text to a different color to highlight it?

Test your code by visiting the web page on your local server. Name the .html file with file name 'lab02_html_lastname_firstname', and upload to Canvas.

3. CSS review

While HTML directly deals with the content and structure, CSS (Cascading Style Sheets) is the primary language that is used for the look and formatting of a web document.

A CSS stylesheet consists of one or more selectors, properties and values. For example:

body {   
    background-color: white;   
    color: steelblue;   
}

Selectors are the HTML elements to which the specific styles (combination of properties and values) will be applied. In the above example, all text within the ‘body’ tags will be in steelblue.

There are three ways to include CSS code in HTML. This is called ‘referencing’.

Embed CSS in HTML - You can place the CSS code within ‘style’ tags inside the ‘head’ tags. This way you can keep everything within a single HTML file but does make the code lengthy.

<head>
  <style type="text/css">
      .description {
      font: 16px times-new-roman;
      }
      .viz {
      font: 10px sans-serif;
      } 
    </style>
</head>

Reference an external stylesheet from HTML - This is a much cleaner way but results in the creation of another file. To do this, you can copy the CSS code into a text file and save it as a ‘.css’ file in the same folder as the HTML file. In the document head in the HTML code, you can then do the following:

<head>
  <link rel=”stylesheet” href=”stylesheet.css”>
</head>

Attach inline styles - You can also directly attach the styles in-line along with the main HTML code in the body. This makes it easy to customize specific elements but makes the code very messy - the design and content get mixed up.

<p style=”color: green; font-size:36px; font-weight:bold;”>
  Inline styles can help when using D3.
</p>

Can you redo questions 3-5 in the previous section with only css? Name the .ipynb file with file name 'lab02_css_lastname_firstname', and upload to Canvas.



In [ ]:

	Title	Year	Rating	Votes
0	!Next?	1994	5.4	5
1	#1 Single	2006	6.1	61
2	#7DaysLater	2013	7.1	14
3	#Bikerlive	2014	6.8	11
4	#ByMySide	2012	5.5	13

	Title	Year	Rating	Votes
57863	Adolfo Perez Esquivel: Rivers of Hope	2015	9.9	9
42123	The Red Shirt Diaries	2014	9.8	6
140553	High-Rise	2015	9.8	5
131241	Girls Loving Girls	1996	9.8	5
24902	Mari White Presents the Newsboys	2011	9.7	6

	Title	Year	Rating	Votes
279320	The Shawshank Redemption	1994	9.3	1511933
264590	The Dark Knight	2008	9.0	1487023
149895	Inception	2010	8.8	1285905
122656	Fight Club	1999	8.9	1189053
223981	Pulp Fiction	1994	8.9	1177471